Machine-generated data (MGD) is the generic term for information which was automatically created from a computer process, application, or other machine without the intervention of a human. However, there is some indecision as to the breadth of the term. Monash Research's Curt Monash, who is generally credited with the introduction of the term, defines it as "data that was produced entirely by machines OR data that is more about observing humans than recording their choices."[1] Meanwhile, Daniel Abadi, CS Professor at Yale, proposes a narrower definition of "Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action."[2] Regardless of the conflict in definition, both exclude data manually entered by an end user[3]. Machine-generated data crosses all industry sectors, and humans increasingly generate the data unknowingly [4].
Contents |
Machine-generated data tends to be amorphous; typically, users never modify this data. Machines often generate this data as a consistent response to an event which occurred. Since the event is historical, the data is less prone to updates and modifications. Partly because of this quality, the U.S. court systems consider machine-generated data as highly reliable.[5].
In 2009, Gartner published that data will grow by 650% over the following five years.[6]. Most of the growth in data is the byproduct of machine-generated data.[3].
Given the fairly static yet voluminous nature of machine-generated data, data owners rely on highly scalable tools to process and analyze the resulting dataset. Almost all machine-generated data is unstructured but then derived into a common structure[3]. Typically, these derived structures contain many data points/columns. With these data points, the challenge lies mostly with analyzing the data. Given high performance requirements along with large data sizes, traditional database indexing and partitioning limits the size and history of the dataset for processing. Alternative approaches exist with columnar databases as only particular "columns" of the dataset would be accessed during particular analysis.[7]